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Abstract 

We present a case study describing efforts to optimise and modernise “Modal”, the simulation and analysis 
pipeline used by the Planck satellite experiment for constraining general non-Gaussian models of the early 
universe via the bispectrum (or three-point correlator) of the cosmic microwave background radiation. We 
focus on one particular element of the code: the projection of bispectra from the end of inflation to the 
spherical shell at decoupling, which defines the CMB we observe today. This code involves a three- 
dimensional inner product between two functions, one of which requires an integral, on a non-rectangular 
domain containing a sparse grid. We show that by employing separable methods this calculation can be 
reduced to a one-dimensional summation plus two integrations, reducing the overall dimensionality from 
four to three. The introduction of separable functions also solves the issue of the non-rectangular sparse 
grid. This separable method can become unstable in certain scenarios and so the slower non-separable 
integral must be calculated instead. We present a discussion of the optimisation of both approaches. 

We demonstrate significant speed-ups of »:100x , arising from a combination of algorithmic improve¬ 
ments and architecture-aware optimisations targeted at improving thread and vectorisation behaviour. The 
resulting MPI/OpenMP hybrid code is capable of executing on clusters containing processors and/or co¬ 
processors, with strong-scaling efficiency of 98.6% on up fo 16 nodes. We find that a single coprocessor 
outperforms two processor sockets by a factor of 1.3x and that running the same code across a com¬ 
bination of both microarchitectures improves performance-per-node by a factor of 3.38x . By making 
bispectrum calculations competitive with those for the power spectrum (or two-point correlator) we are 
now able to consider joint analysis for cosmological science exploitation of new data. 


1 Introduction 


The current best explanation for the origin of our universe is the inflationary big bang scenario, where it 
is believed that a period of exponential expansion created the large flat empty universe we see today. In 
addition, this model predicts that quantum fluctuations in the energy during this time will be stretched to 
galactic scales forming the seeds from which all structure grew, from planets and stars through to super¬ 
clusters of galaxies. The statistics of these fluctuations give a window onto the dynamics at play during 
the birth of our universe. In particular, any deviation of these fluctuations from a Gaussian (i.e. Normal) 
distribution would be direct evidence of interesting new physics. 

There has been enormous effort within the community to measure any possible deviations from Gaus- 
sianity with the bispectrum, the Fourier transform of the three point correlator, being the favoured statistic. 
The primary obstacle to naive estimation of the bispectrum is that for the CMB it is 5 dimensionaQ and 
would require (9(10^^) floating point operations to calculate, which is challenging for the world’s largest 
supercomputers. This can be overcome by using separable approximations for the bispectra, however the 
projection of the primordial bispectra forward to the time of observation remains a major obstacle to mea¬ 
surement. There are a multitude of approaches to this which divide into two main categories: ones that 
require non-Gaussian simulations to train estimators, M; and those that use specific simple primordial 
templates for which projection is tractable, M- 

The Modal method 111 1^ developed at the University of Cambridge, which is the focus of this paper, 
and used by the Planck satellite experiment miE) is the only general method for constraining these 


hhis is because it is the average of three vector quantities on a 2D surface (the CMB) which gives you 6 dimensions, enforcing 
momentum conservation then removes one of these. 
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non-Gaussianities from the available data. Its main strength is that by using a general mode expansion it 
can reduce the evolution from primordial to late times into a matrix projection, allowing us to constrain 
thousands of theoretical predictions simultaneously. By using an appropriate basis tuned for the theoretical 
models of interest, the Modal team have created a fast and efficient way to probe cosmological data for hints 
of new physics in the early universe. 

This paper investigates the optimisation and modernisation of Modal, as part of an effort to accelerate 
it using Intel® Xeon Phi coprocessors. The existing MPI-level parallelism in the original code is not suf¬ 
ficient to enable efficient utilisation of this hardware, and we show that moving to a hybrid MPI/OpenMP 
implementation can significantly improve performance. For portability reasons, we avoid making any sig¬ 
nificant code changes that would benefit only one particular hardware platform, and thus the high-level 
programming languages and techniques that we use apply equally well to Intel® Xeon® processors. When 
compared to the original code on 2x processor sockets, our code optimisation efforts deliver speed-ups of 
1765x on a single coprocessor and 833x on 2x processor sockets in the 2D case; and 108x on a coproces¬ 
sor and 83.9x on 2x processor sockets in the 3D case. These speed-ups are large enough to significantly 
impact the rate of scientific discovery at COSMOS, and enable liberal use of the Modal calculation as part 
of future Monte Carlo pipelines - something that had previously been considered infeasible. 

A number of previous studies have investigated the use of Intel Xeon Phi coprocessors to accelerate 
other scientific codes 115 ■^, and there are many similarities between the optimisations we discuss here 
and those explored in other domains. However, we note that many of these studies were performed before 
the standardisation of OpenMP 4.0 (and thus often rely on manual vectorisation via architecture-specific 
intrinsics). This is also the first paper (to the best of our knowledge) to explore the use of Intel Xeon Phi 
coprocessors for this specific application domain. 

The rest of this paper is organized as follows: Section provides an introduction to the two Modal 
routines which are being optimised - the full three dimensional calculation on the sparse non-rectangular 
domain, and the fast two dimensional version on a dense rectangular domain - and also provides a high- 
level introduction to Intel Xeon Phi coprocessors; Section [^details the optimisation and modernisation of 
Modal; Sectionj^presents a detailed performance study of the final application, demonstrating its scalability 
within a node and across multiple nodes; and finally, Sectionj^concludes the paper, and discusses potential 
new uses for the accelerated version of the code. 


2 Background 

2.1 Direct Integration 

The primary concern of this paper is the efficient calculation of a three-dimensional inner product which 
concerns the projection of a set of primordial basis bispectra defined at the end of inflation into a set of basis 
bispectra on a spherical shell defined by the CMB; that is, we evolve from an early-time basis into another 
different basis which is more convenient at late times. We consider two late-time bispectra and 

depending on the spherical harmonic multipoles and we take the following inner product betweeen them: 
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The array in brackets is the Wigner 37 symbol, which is a geometric factor related to the projection onto 
the 2-sphere of the CMB. It has two important properties. Firstly, it is zero if the three 7,, when treated as 
lengths, are unable to form a triangle (the triangle condition) and secondly it is zero whenever ti + {2 + 
is odd (the parity condition). These two conditions present complications in evaluating the sums over the 
li as the region is non-rectangular and inside the allowed region all odd combinations must be excluded. 
These two constraints present issues with load balancing and vectorisation, which will be discussed later. 
When both these conditions are met, has an exact expression in terms of factorials which in turn can be 
accurately calculated using the Gosper approximation to the Wigner 3J symbol. The resulting expression 
is: 
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where L - C\ + (2 + (z and Lj - L - If,. This inner product needs to be calculated between our projected 
primordial basis Q, which unavoidably involves a radial integral along a line of sight from primordial 
times until now, and our late time basis Q. For the estimation and projection to be tractable we must form 
our basis functions from products of one-dimensional basis functions q and q respectively which must be 
symmetrised over the f,. Writing it out explicitly in terms of these ID functions; 
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where 0 is a top-hat like function (which is I when the triangle and parity conditions are met and 0 else¬ 
where) and there is a known mapping between n —» / jk. Evaluation of Equation (where Vf, C( and q 
have been precomputed) is performed by “Modal3D” (note that as we have a radial integral in addition to 
the three summations, the calculation is in fact four-dimensional for each of the independent matrix 
entries). Eor full details of the origin of Equation see the Appendix. Typical values for the problem size 
are n„jax ~ 2000, {max ~ 2000 and approximately 200 points needed to calculate the integral over r using 
splines. 


2.2 Separable Integration 


One significant simplification that can be made in some cases is to note that the Wigner 37 symbol can be 
written as an integral over three Legendre polynomials so in Equation [^takes the exact form: 
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This automatically preserves the triangle and parity conditions allowing us to work on a domain which is 
both rectangular and dense. This allows us to write Equation|^in a much simpler form: 

^ f + 5 perms') , (6) 

where we have made the definition: 
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The simplified expression in Equation]^ is solved by “Modal2D”. One consequence of trying to retain the 
step like conditions of while using an integral form is that this calculation is very sensitive and must 
be done to very high precision. Eortunately as the integrand is polynomial we can use Gauss-Legendre 
quadrature which is, in principle, exact. However we found that standard libraries for calculation of weights 
and abscissas were not sufficiently accurate and we had to use a specialist implementation which had been 
optimised for this purpose 1201 before the calculation proved stable for specific choices for Q and Q. We 
note that this stability is not sufficient to allow us to reverse the order of the r and p integrations and the p 
integration must always be carried out first. 

It is this issue with stability which leads us to retaining both implementations of the calculation. The 
2D version is employed for any bispectra where bases can be chosen so that the integral representation of 
is stable, with any remaining cases being calculated by the slower but robust 3D version. We found that 
the majority of bispectra we are interested in can use the 2D version but there are some very well-motivated 
cases that cannot, for example the bispectrum induced by gravitational lensing. 

The original code is written in C and parallelized with MPI. The 2D variant is parallelized over the 
loop of all the inner products whereas the 3D variant is parallelized over the n and {\ indices in 

Equation]^ Throughout the rest of this paper, “Modal” without a 2D or 3D suffix can be assumed to refer 
to both variants of the application. 


2.3 Intel Xeon Phi Coprocessors 

An Intel Xeon Phi coprocessor consists of many (=s 60) low frequency in-order cores which share a co¬ 
herent memory. Each of these cores can run up to four hardware threads, which is useful for hiding the 
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Table 1; System configuration for a single node of Cosmic. 



Intel Xeon Processor E5-4650L 

Intel Xeon Phi Coprocessor SHOP 

SocketsxCoresxThreads 

1x8x1 

1 X 60 X 4 

Clock (GHz) 

2.6 

1.053 

Double Precision GELOP/s 

166.4 

1010.88 

L1/L2/L3 Cache (KB) 

32 / 256 / 20,480 

32/512/- 

DRAM (per node) 

58 GB 

8 GB GDDR 

STREAM Bandwidth 

26.4 GB/s 

165 GB/s 

PCIe Bandwidth 

6 GB/s 

Compiler Version 

icc vl5.0.0.090 

MPI Version 

SGI MPT 2.10 

MPSS Version 

3.2.1 


latencies of memory accesses and multi-cycle instructions in the absence of out-of-order execution. Unlike 
the “hyperthreads” on Intel Xeon processors, a single thread on the coprocesor cannot issue instructions 
on back-to-back cycles - it is therefore necessary to use at least two threads per core to fully utilize the 
hardware. 

Each core additionally has support for 512-bit SIMD instructions from the Initial Many-core Instruction 
(IMCI) set. Executing a fully vectorized fused multiply add (EMA) every cycle amounts to 16 double 
precision ELOPs per cycle per core, or a theoretical peak of ail TELOP/s in double precision. 

The coprocessor is physically mounted on a PCIe card with its own GDDR memory and Linux operating 
system. Since the cores are x86-based and run Linux, the coprocessor is amenable to existing parallel pro¬ 
gramming languages and libraries such as MPI and OpenMP - compiling existing codes for the coprocessor 
is often very simple, but extracting performance may require some significant algorithmic restructuring to 
expose sufficient parallelism ID- 

2.4 Experimental Setup 

We use the Cosmic supercomputer at the University of Cambridge, which is an SGI UV2000 system con¬ 
sisting of 28 processors, 24 coprocessors, and 1.6 TB of RAM. The specification of a single node of Cosmic 
is given in Table[T] The host processors are based on the microarchitecture previously code-named “Sandy 
Bridge”, and the coprocessors on the microarchitecture previously code-named “Knights Corner”. Eor the 
sake of brevity, we refer to these two microarchitectures henceforth as “SNB” and “KNC”, respectively. 

All experiments (unless otherwise stated) were performed using all of the cores available within a node, 
running the maximum number of threads supported - 4 threads on KNC and 1 thread on SNB (hyper¬ 
threading is disabled on the host) - and the thread affinity used on KNC is set using: KMPJVFFINITY= 
close, granularity=fine. KNC is used in offload mode, and no additional flags are passed to the com¬ 
piler beyond those used for SNB: -03 -xHost -mcmodel^medium -restrict -align -£no-alias 
-qopenmp. All experiments are repeated 5 times, and we present the average (mean), in order to account 
for system noise and any non-deterministic performance effects arising from threading. 

Profiling and analysis of the code was performed using Intel® VTune Amplifier XE and the optimi¬ 
sation reporting functionality of the Intel® C compiler. Eor both forms of the algorithm, we measure the 
performance in terms of the number of loop iterations executed by the whole code per second. Since the 
amount of work per loop iteration is different in the two cases, we differentiate between them by referring 
to them as 2D and 3D iterations per second (2D it/s and 3D it/s) respectively. Henceforth we will also adopt 
a convenient notation for displaying the performance figures for SNB and KNC side-by-side as a tuple - 
(SNB, KNC) it/s. 

3 Optimisation and Modernisation 

Prior to exploring any algorithmic changes, hardware-specific optimisations or code modernisations, we 
carried out some high-level refactoring of the two Modal variants to improve their performance. Specif¬ 
ically, we replaced a number of abstract “getter” functions used to retrieve array values from other com¬ 
pilation modules with inlined direct array accesses, removing function call overheads and providing the 
compiler with more optimisation freedom. Some occurrences of routines from the GNU Scientific Library 
(GSL) were also replaced with equivalent, pre-optimised, routines from the Intel® Math Kernel Library 
(MKL). 
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Listing 1: Code snippet for gainma_pt in Modal2D. 


double gamma.pt ( int m, int n, int i) { 

for ( j =0; j <npt ; j ++) { 

for ( 1 =0; 1 <1 size ; 1 ++) { 

factor [1] = lweight[l]*gl_Pl[j][l]; 

I 

for(r=0;r<3;r ++) { 

for ( s =0; s <3; s ++) { 
suml = 0.0; 

for ( 1 =0; 1 < 1 size ; 1 ++) | 

suml += f ac tor [ 1 ] * basis [ pvec [ r ] ] [ 1 ] * beta [ i ] [ qvec [ s ] ] [ 1 ] ; 

I 

Nmap [r][s] = suml; 

I 

I 

s 1 = Nmap [ 0 ] [ 0 ] * Nmap [ 1 ] [ 1 ] * Nmap [ 2 ] [ 2 ]; 

s2 = Nmap[0][0]*Nmap[l][2]*Nmap[2][l]; 
s3 = Nmap[0][l]*Nmap[l][0]*Nmap[2][2]; 
s4 = Nmap [ 0 ] [ 1 ] * Nmap [ 1 ] [ 2 ] * Nmap [ 2 ] [ 0 ]; 

s 5 = Nmap [ 0 ] [ 2 ] * Nmap [ 1 ] [ 0 ] * Nmap [ 2 ] [ 1 ]; 

s 6 = Nmap [ 0 ] [ 2 ] * Nmap [ 1 ] [ 1 ] * Nmap [ 2 ] [ 0 ] ; 

suml = s 1 + s2 + s3 + s4 + s5 + s6 ; 
result += gl _ wgt [ j ] * suml ; 


return result; 
I 


The purpose of performing such a high-level refactoring before engaging in other optimisation activities 
is twofold; first, it ensures that the baseline performance of the code is representative of the performance 
of the original algorithm (as opposed to only its implementation); second, it exposes more accurately the 
incremental performance improvements from our subsequent changes to the code. After our refactoring, the 
SNB performance of the 3D variant of the algorithm increased by 3.27x , however the 2D variant showed 
negligible improvement. 

3.1 Modal2D 

The hotspot in the Modal2D code is in a function called gamina_pt (see Listing which evaluates the 
inner-most integral over p in Equation via Gauss-Legendre (GL) quadrature, which we repeat here for 
clarity; 

^ ^ f ^ perms) (8) 

(2/’ +1) 

Pw (L P)=Y - -(L {)qi({)Pdp) ■ (9) 

f vtyCt 

The indexes mn are the element of T' being calculated with the index i corresponding to the point in the 
r integration. The j loop is over the GL-quadrature points for p, s cycles over ijk and r over i' fk'. The 
array Nmap is P;,', with beta corresponding to q and basis to q. The array factor contains the product of 
all functions solely of { and the Legendre polynomial P(. The il-6 are the six permutations of the product 
of the Put required by symmetry and finally gl_wgt is the GL-quadrature weight. Since the size of L is 
(9(1000^), there is a significant amount of exploitable parallelism present. 

3.1.1 Threading 

The original code has the loop over mn points as a nested loop (n as the outer loop, m as the inner loop), 
and distributes the outer loop across processors using MPI. We opt to extend this by distributing both the 
inner loop and the MPI subdomain of the outer loop across cores within a node using OpenMP. Since the 
amount of work performed for each iteration of the m loop is the same, and the calculation of each mn 
point is separable and independent of all others, this is achieved simply through the application of an omp 
parallel for collapse (2) pragma. 
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Figure 1; Performance improvement of Modal2D from optimisation and modernisation. The numbers 
above each of the bars are the performance difference between a coprocessor and two processors for that 
version. 


3.1.2 Vectorisation 

The code in its original form does not express the algorithm in a way that is conducive to compiler auto- 
vectorisation, and the Intel® compiler is unable to vectorise the rsl loop nest because of the indirection 
arising from the pvec and qvec arrays. Since the rs loop nest has a hard-coded trip count of 9 iterations, 
it is practical to simply unroll this loop nest and compute the 9 sums simultaneously, thereby replacing the 
pvec and qvec with scalars and removing the indirection. Loop unrolling also gives the compiler more 
freedom to re-order the instructions, which is particularly important for an in-order core like that of KNC. 
Following these code transformations the compiler is able to auto-vectorise the loop, resulting in speed-ups 
of 2.65x on SNB and 7.0x on KNC - performance of (5.2, 11.32) 2D it/s. Tuning the data layout to assist 
vectorisation (i.e. ensuring data alignment) and hoisting some repeated calculations from the inner-most 
loop increases performance further, to (5.3, 12.0) 2D it/s. 

3.1.3 Algorithmic Improvements 

A significant issue with the original algorithm is that the value P^,(x,j(i) is calculated repeatedly from Equa¬ 
tion H Looking at the sizes of the dimensions of Equation^ - 0 < / < ^N,erms^ 0 < x < 216, and 
Q<fi< 3000 - if we were to pre-calculate P,-,/ for all values of /, jj., and x then the resultant array would 
require only 1 GB of storage (in double precision), which is small enough to remain within the 8 GB of 
GDDR available on KNC. Pre-computing P„v in this way reduces algorithmic complexity signihcantly, 
yielding a dramatic speed-ups of 278x on KNC and 300x on SNB - giving a new performance of (1649.3, 
3344.5) 2D it/s. 

Eollowing the introduction of the new algorithm, it is necessary to re-examine the hotspots and any 
assumptions about achievable performance. Since the size of the look-up table we introduce is larger 
than the size of cache, and its access pattern is too irregular to benefit from blocking, our changes shift 
Modal2D from compute- to memory-bound. Eurthermore, running the full number of threads per core 
causes contention for entries in each core’s Translation Look-aside Buffer (TLB) - somewhat counter¬ 
intuitively, we can increase performance by a further 5% by reducing the number of threads used per core 
to 2. 

3.1.4 Results 

The graph in Eigure shows the speed-up resulting from each of our optimisations and modernisations. 
Eor the original code following re-factoring and the introduction of threads (Version 1) we see that KNC 
is out-performed by two SNB sockets. However, tuning to ensure efficient use of vectors (Versions 2-4) 
is sufficient to invert this relationship, and highlights the importance of achieving high SIMD efficiency 
on KNC. The most signihcant speed-ups arise from algorithmic change (Version 5) on both processor and 
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Table 2: Execution times of Modal2D at various stages of code optimisation. 


Version 

2x SNB (s) 

lx KNC (s) 

Comment 

1 

182005.17 

225028.22 

Original code. 

2 

68748.59 

31906.09 

Unroll and vectorize loop. 

3 

68628.19 

31304.09 

Memory alignment. 

4 

67303.79 

30100.08 

Remove repeated calculations. 

5 

219.0 

108.0 

New algorithm. 

6 

219.0 

103.33 

Reduce number of threads per core to 2 on KNC. 


Listing 2: Code snippet for the hotspot in ModalSD. 


for(n = 0; n<terms ; n++) { 

for(ll=0; ll<lsize; 11++) | 

for(m=0; nKterms ; m++) mvec [m] = 0.0; 
for(12 = ll ; 12<lsize ; 12++) { 

for(13 = 12 + ll%2; 13 <min( 11 +12 , Imax ) +1; 13+=2) { 

X = calculate_xint(ll,12,13,n,xsize,xvec,yvec,task); 

z = permsix(ll , 12 , 13 ) * c ale u 1 at e _ge ome trie ( 11 , 12 , 13 ) / s qrt ( s 1 * s2 * s3 ) ; 
for (m=0;iTKterms ;m++) | 

y = plijk (m, 11 , 12 , 13 ) ; 
mvee [m] += x*y*z; 


// array reduction of mvec into gamma matrix 

I 


coprocessor. On KNC we also show that using less threads per core can give a boost in performance for 
this case (Version 6). (See Table]^) 

3.2 Modal3D 

In the original ModaOD code we perform a brute force computation of Equation]^ which we repeat here 
for clarity 
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The majority of the time is spent in the loop structure shown in Listing]^ The outer loop(n) iterates over 
rows of r as the computation of the projected primordial modes, Q, done by ^-triple by calculate_xint 
(which includes the r integration) is the most expensive operation. All pre-factors in L, and L are 
calculated and stored in z before we compute the inner product with all the late time modes (calculated by 
plijk) for the ^-triple. The constraints on the allowed ^-triples give rise to the triangular 11,12,13 loops 
(note the parity constraint in the 13 loop which ensures that the sum of the £i is even). 


3.2.1 Threading 

As with Modal2D, the original code is parallelised with MPI only over the n loop, and introducing threading 
via OpenMP is a necessary step to improving utilisation of KNC’s many-core architecture. However, in the 
3D variant there are multiple candidate loops that could be parallelised with OpenMP. The inner-most m loop 
best matches the approach taken in the 2D case, but would be an unwise choice here (since the calculation 
of X and z would not benefit from parallelisation). Threading the ^i^2^3 loops instead would be a better 
idea, if not for the fact that the loop space is triangular; threading only one of the loops would lead either to 
work imbalance or an iteration count too small to fully utilize all of the available cores. 
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Figure 2; Diagram explaining how we collapse the three loops into a single iteration space. 


Ideally we would like to flatten the whole space into a single loop, similar to the effect of 

OpenMP’s collapse(3) construct. Use of this construct is illegal here, since the number of iterations 
of the inner loops depends on the indices of the inner loops We experimented with many different 
approaches of achieving good load balance using standard OpenMP: spawning separate OpenMP tasks for 
groups of loop iterations; recursively partitioning the iteration space; and using dynamic scheduling with a 
threaded outer-loop. In all cases we found that the overheads of these methodologies was significant, the 
load imbalance was not necessarily adequately addressed, and cache behaviour suffered due to the inability 
to specify thread affinities. The Intel® Threading Building Blocks (TBB) library provides task-affinity 
functionality that may have assisted us with this last point, but we have two reasons for not using it; first, to 
keep Modal as a C (rather than C-n-) application; and second, to ensure that our threaded version did not 
depend heavily on non-standard language features and libraries. 

Instead, we carve up the iteration space manually across tasks, and assign contiguous “chunks” of loop 
iterations to each thread. This scheme (demonstrated in Figure]^ allows us to achieve very good load bal¬ 
ancing at the expense of a small one-time setup overhead. Note that this method is scalable beyond threads 
- we can manually carve up the total iteration space across MPI, then across sockets/coprocessors 

and finally across threads - which allows us to improve the multi-node scalability of the code. 

While restructuring the loop for parallelism, we brought the n loop inside of the f’if’ 2^3 nest to reduce the 
amount of redundant work per thread. However, this increases the amount of temporary storage required 
by each thread to store its own partial result of F,„„. The size of this storage increases by the square of the 
number of terms, and is (9(10 MB) for a typical problem size. Assuming 4 threads per core, the amount of 
temporary storage far exceeds the 512 KB capacity of KNC’s L2 cache, and we see performance degradation 
due to thrashing effects. If all the threads on one core collaborate to read and write to the same array, then 
the size of this temporary storage is reduced by a factor of 4 - while it still doesn’t fit in the cache, blocking 
the loop to compute B iterations between accesses can amortize the bandwidth overheads. 

Collaboration between threads in this way requires a “nested” parallelism model, with a 2-tier hierarchy 
of threads {e.g. C groups of 4 threads, where C is the number of cores). A naive implementation of nested 
parallelism in OpenMP is not practical here because the cost of creating and destroying new teams of 
threads for each iteration of the inner loop is prohibitively expensive. Instead, we adopt an approach where 
the threads in both tiers are spawned together, and work is assigned to threads based on a combination of 
their team id (i.e tid / 4) and local worker id (i.e. tid % 4) - all threads execute the same set of 
loop iterations as the other threads in their group, but execute for different eigenmodes inside the angular 
momentum loop. 

Although this manual nesting approach gives us a great deal of flexibility, it depends on a custom on- 
core barrier (such as the one shown in Listing to synchronize only the threads in a particular group. 
There are two reasons to use a local barrier instead of the global barrier provided by OpenMP: first, a global 
barrier scales 0(log2(4-C)), whereas a local barrier has a constant overhead of (9(4); and second, a global 
barrier can easily lead to accidental deadlocks, since all threads in all groups must be present for all barriers. 

3.2.2 Algorithmic Change 

The bottleneck left in the code after the loop modernisation is the integration routine called by calculate_xint. 
This is a ID integral of a function f{r) with respect to the radial direction along the line of sight, r from 
now back to the time of inflation. These functions are highly peaked at r « 14,000, around the time the 
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Listing 3: Example implementation of a fast, on-core, barrier. 

typedef __declspec ( align (64)) struct corebarrier_t 

{ 

volatile int barrier[2]; 
int padding [14]; 

} corebarrier.t ; 

typedef __declspec ( align (64)) struct barrier.t 


int 

barrier.id ; 

// 

which 

barrier 

to use this time (0 or 1) 


int 

core.tid ; 

// 

local 

thread 

id on the core 

( tid % 4) 


int 

flag [2]; 

// 

value 

to write this time (0 

or I ) 


int 

waitval ; 

// 

value 

to wait 

on if threads 

are writing 

Is 

corebarrier.t* me ; 

// 

corebarrier _t 

shared by all 

threads on 

the 


} barrier.t; 


core 


void cpu.pause () 

{ 

// pause, sleep, or delay this thread 

I 


void c o r e b ar ri er ( b a r r i e r _ t *bar) 

{ 

// determine which barrier and values to use 

int barrier.id = bar->barrier_id ; 

int core.tid = bar->core_tid ; 

int flag = bar ->flag [ barrier.id ]; 

int waitval = (flag) ? bar->waitval : 0; 

// loop until all threads have written to this barrier 
corebarrier.t *me = bar->me; 

((char * )&me-> b ar r i e r [ b ar r i e r _i d ]) [ c o r e _ti d ] = flag; 
while (me->barrier [ barrier.id ] != waitval) 
cpu.pause (); 

// toggle the barrie r / fla g for next time 
bar ->flag [ barrier.id ] = 1 - flag; 
bar-> barrier.id = 1 - barrier.id; 

1 
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Figure 3: Performance improvement of ModaOD from optimisation and modernisation. The numbers 
above each of the bars are the performance difference between a coprocessor and two processors for that 
version. 


CMB was formed, and are relatively flat everywhere else. To decrease computational cost and memory 
consumption, these functions are sampled with 3 different resolutions in r, with more points around the 
surface of last recombination, and less points in the flat regions. Integration was performed by hrst htting a 
spline to the data points and then integrating using the spline to interpolate points which are missing from 
the sample. In the original code this procedure is performed by the routine gsl_spline_eval_integ from 
the GSL, but was replaced with a faster equivalent from the Intel® MKL. We find that although the method 
from MKL has better vectorisation than the method from GSL, they both suffer from the same problems - 
they are too memory intensive (storing a cubic spline in addition to the data) and computationally expensive 
(solving a set of linear equations). 

We hnd that replacing this procedure with a simpler numerical integration routine can drastically speed¬ 
up the application even if it requires a greater number of sample points to achieve the same accuracy. We 
tried two integration methods for this - a simple application of the Trapezium rule and a method that uses 
Hermite Cubic spline interpolation. These two methods can compute the integral using local points only, 
unlike the GSL Spline which needs all points to compute the integral. Local methods are much more 
amenable to vectorisation and also requires (9(1) temporary storage. The Hermite spline integrator is based 
on the interpolation of r between points and r^+x using a Hermite cubic function: 

piru) = (2f^ - 3f^ -H l)yi H- {P - 2t^ + f)(r*+i - n)p'irk) + (-2t^ + 3t^)yk+i + (P - f^)p'(r*+i) (11) 

Where t - {r- rk)l{rk+\ - r^). The exact derivatives of y(r) are not available so we must approximate. Here 
we use a simple approximation of p'(r^.) = (yk+i - yk)l{xk+i - Xk). Integrating Equationwith respect to 
r and using this derivative yields the integrator: 

fdrm («* 4''* (S ■ Si;)) 

For the results reported here, we use a simple application of the Trapezium rule, combined with 216 
sampling points. This scheme has sufficient numerical accuracy for us to obtain a physically meaningful 
answer, within a few percent of the answer calculated by GSL. Other numerical integration routines (e.g. 
Gaussian quadrature) may be required for other corner cases, and we leave this investigation to future work. 

This leaves two bottlenecks in the code - the partial reduction stage in calculate_garama_3d, and the 
calculate_xint routine. The reduction step is actually just a matrix multiply - PimXi,,, and thus it 
can be replaced with a call to the BLAS-3 DGEMM routine from Intel® MKL, which is cache-blocked and 
vectorized efficiently out-of-the-box. We call MKL from only a single thread in each of our nested thread 
groups, and empirically derive the best blocking factor B to be 64. 

3.2.3 Results 

The graph in Figure shows the speed-up resulting from each of our optimisations and modernisations. 
For the original code following re-factoring and the introduction of threads (Version 1-3) we already see 
signihcant gains on the host (3.27x ). Note that we do not provide KNC results until the introduction of 
OpenMP (Version 4) since we are using the offload model, and offloading work to a single KNC thread does 
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Table 3: Execution times of ModalSD at various stages of code optimisation. 


Version 

2x SNB (s) 

lx KNC (s) 

Comment 

1 

2887.0 

- 

Original Code. 

2 

2610.0 

- 

Loop simplification. 

3 

882.0 

- 

MKL Integration. Remove getter methods. 

4 

865.9 

1991.6 

Flattened loops -H threads. 

5 

450.6 

667.9 

Loop reorder, manual nested threading. 

6 

385.6 

655.0 

Blocked version of loop. 

7 

46.9 

49.5 

Trapezium Integration. 

8 

37.4 

37.7 

Reduction with DGEMM. 

9 

35.1 

34.5 

Alignment-i-Padding of Arrays. 

10 

34.3 

26.6 

Software Prefetching. 


not make sense. Tuning the threading behaviour (Versions 5 and 6) continues to improve performance on 
both platforms but, as in the case of Modal2D, more significant speed-ups are only possible with algorithmic 
change - in this case, a change in integration method (Version 7). Further tuning of vectorisation behaviour 
(Versions 8 and 9) provide a small boost in performance, and tuning the prefetch distance (Version 10) 
finally pushes the performance of KNC ahead of SNB. (See Table]^) 

Since the same code can be recompiled and ran on both processor and coprocessor, very little effort was 
required to make the code heterogeneous. We use the asynchronous offload features of Intel® Language 
Extensions for Offload (LEO) to offload a fraction (empirically derived to be 71%) of the total problem to 
KNC, and process the remainder on SNB. On a single node of Cosmic, using KNC in addition to SNB gives 
a 3.5x speed-up over using SNB only. 

4 Performance Study 

The new implementations of the 2D and 3D variants of Modal are optimised but they are not optimal. In this 
section, we analyze and discuss the remaining bottlenecks to performance at both single- and multi-node 
scale. 

4.1 Integrator Performance: Accuracy vs. Execution Time 

In Modal3D, swapping out the spline-based integrator for a simpler one significantly accelerates the code on 
modern architectures, but at the price of reduce accuracy. We show that this accuracy can be both recovered 
and increased when using the simpler integrators by increasing the number of sampling points. Moreover, 
even with a larger sampling size the performance is still likely to be far better than the spline based approach 
using less points. 

Figure [4a| shows the percentage root mean squared error (RMSE) of the unit normalised T produced by 
Modal3D, for different integration methods and different numbers of sample points. We take the result of 
Modal3D with using the GSL Spline with 1768 sampling points to be the “gold standard” to compare all 
the others to. All three methods converge towards the gold standard with increasing sample points. The 
Hermite integrator has nearly identical error to the GSL spline and achieves an error of < 1 x 10“^% of the 
gold standard with the maximum number of points. The Trapezium integrator is the least accurate of the 
three, but still gets an error of < 1 x 10 "^% of the gold standard. If we consider the accuracies in light of 
the times to solution shown in Figure |4bl a clear picture emerges - even after increasing the accuracy of the 
Hermite and Trapezium integrators, they remain much faster than the GSL spline with a small number of 
points. The Trapezium integrator is the fastest of the three, but is only ~ 1.25x faster than the Hermite. 

4.2 Scaling with Cores 

Figures and show how performance of Modal2D and Modal3D scale with the number of cores on 
SNB and KNC, respectively. The number of threads per core on KNC is fixed at the number that gives the 
highest performance in each case - 2 and 4 threads per core for Modal2D and Modal3D, respectively. 

For Modal2D, we see that scaling tapers off with increasing core-count, reaching a maximum speed-up 
of 32.4x . This is not for want of parallel work, nor high synchronisation costs, but rather because the cores 
are competing for limited memory bandwidth. Running with all 59 cores on KNC, the bandwidth from 
GDDR to L2 was measured (by Intel VTune Amplifier XE) to be 147.5 GB/s on average - very close to its 
peak STREAM p2| bandwidth of 165 GB/s (see Table[T]|. The fact that we are hitting close to peak shows 
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(a) Percentage RMSE of P using different integrators and 
numbers of sampling points compared to the GSL Spline 
method with 1768 sampling points. 


(b) Runtimes of entire ModalSD calculation using dif¬ 
ferent integrators and varying the numbers of sampling 
points. 


Figure 4: Comparison of the accuracy and performance of the different integration methods used in 
ModaDD. 




(a) Modal2D (b) ModaBD 

Figure 5; Strong-scaling within a single node, for 601 modes. 
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(a) Modal2D (b) ModaBD 

Figure 6: Strong-scaling on multiple nodes, for 601 modes. 


that there is little room left to further tune the computation performed by Modal2D; any additional work on 
the 2D variant will need to focus on improving cache locality or algorithmic complexity, not vectorisation. 

For ModaOD, we see much better scaling with increasing core-count (close to linear). This is to be 
expected, since the code is very compute intensive. Furthermore the groups of threads, once spawned, 
require no communication with each other until the reduction stage at the very end of the calculation. 
The scaling behaviour suggests that, unlike the 2D variant, ModalSD is not bound by memory bandwidth. 
However, we are not yet instruction bound, either - running with all 59 cores on KNC, the number of vector 
instructions issued per cycle was measured by Speedometer | |25j to be 39.8% of peak. The performance of 
ModaDD is in fact limited by transfers between L2 and LI cache; although there is a large amount of data 
re-use within a core, the number of streams per thread is too high for the prefetchers to effectively hide the 
latency of L2 accesses. 

4.3 Scaling with Nodes 

Figures and show how performance scales with the number of Cosmic nodes for Modal2D and 
ModaOD, respectively. For ModaOD we show results for SNB and KNC alone, while for ModaOD we 
also show results for a “hybrid” model running 30% of the work on SNB and 70% of the work on KNC. In 
all cases, we place a single MPI rank per node. The reader is reminded that a single Cosmic node contains 
only a single SNB socket, and a single KNC coprocessor (see Table[T]). 

For both the 2D and 3D variants, we see scaling that is fairly close to linear. There are two reasons 
for this: first, the only communication required between tasks occurs right at the start of a run (to agree on 
task decomposition) and right at the end (to reduce the final gamma matrix); second, there is a significant 
amount of work (601 iterations in the 2D case, and 2x10® iterations in the 3D case) to be split between 
MPI ranks, so we do not reach a scale where communication costs begin to dominate execution time. Note 
that although the single-node scaling of Modal2D is inhibited by hardware limitations, this is not the case 
here - when scaling to multiple nodes, the total available memory bandwidth also scales accordingly. 

4.4 Scaling with Problem Size 

Figures|7^and|^show how performance scales with problem size for Modal2D and ModaBD, respectively. 
As in the previous section, we again report performance for three different configurations: SNB alone; KNC 
alone; and hybrid execution across both processor and coprocessor. Across all variants and problem sizes, 
KNC beats SNB. Focusing on the 2D variant, the gap between KNC and SNB is largest for large problems. 
For the two small problems with 50 and 101 terms, the execution time is dominated by the precompute stage 
of calculation which does not scale well with increasing numbers of threads, and which is thus relatively 
expensive on KNC. For the larger problems like the 601 and 1001, the execution time is dominated by the 
real computation of F,„„, which is highly scalable on KNC and thus KNC gains and even larger speed-up of 
4.lx over the SNB for the 1001 case. 

For the 3D variant, the gap in execution times between SNB and KNC is more consistent, owing to the 
large amount of parallelism present even in small problems as a result of our fine-grained decomposition of 
the loops. All the times in|^are for the same T’lT’af’s space of 18712695 iterations, but for a different number 
of terms. The amount of computation required in each of Modal3D’s functions scales at different rates with 
respect to the problem size, and this is reflected in the execution times. The number of eigenmodes that 
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(a) Modal2D 


(b) ModaBD 


Figure 7; Performance as a function of problem size. In (a) precompute times are stacked on top of the 
compute times. 


must be integrated scales linearly, but the size of the Gamma matrix and thus the overhead of reducing it 
scales quadratic ally. For all the problem sizes we have tested, integration remains the top hotspot, which is 
why the scaling is close to what one would expect, but the increasing cost of other functions is noticeable. 

5 Conclusions 

We have presented an optimisation study for the 2D and 3D variants of “Modal”, an early universe sim¬ 
ulation and analysis code. It is representative of two common computational challenges: evaluation of a 
multi-dimensional integral/sum both on a rectangular dense domain, and on a domain which is neither. The 
optimisation steps detailed here would be applicable to any similar code. 

Through a combination of algorithmic improvements, the introduction of thread-level parallelism, and 
exposing opportunities for hardware vectorisation, we have achieved significant whole application speed- 
ups: 1765x on KNC and 833x on 2x SNB in the 2D case; and 108x on KNC and 83.9x on 2x SNB in the 
3D case. In both cases, the greatest source of speed-up is algorithmic change, and the increased amount of 
exploitable parallelism it brings. Although still significant, hardware-specific tuning of the new algorithms 
yields less than lOx improvement in performance. 

Our use of standard programming languages ensures that code changes benefit not only the new Intel 
Xeon Phi coprocessors, but also Intel Xeon processors - and performance improvements are expected to 
persist on the next generation of processors and coprocessors (codenamed “Knights Landing”). Investing 
in code optimisation and modernisation today can deliver significantly greater gains than waiting for fu¬ 
ture advances in processor technology and will ensure that we maximise the science done on any given 
architecture. 
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A Physics of the Modal code 

A.l Theory 

We observe the temperature T of the cosmic microwave background (CMB) on a distant sphere and we can 
represent anisotropies in this temperature AT /T as 

AT ' 

(13) 

Im 

where T;m(n) are the usual spherical harmonics with multipoles {,m with —{< m < (. Most quantita¬ 
tive cosmology has developed using the two-point correlation or power spectrum defined from a 

average over the azimuthal multipole m 


Ct 


1 

2e +1 




(14) 


However, we are interested in new information from the three-point correlator or bispectrum, averaged over 
orientations as the CMB is assumed isotropic. 
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(15) 


This is the spherical respresentation of the 2D CMB bispectrum of triangles on the sky. One of our key 
goals is to connect this to the 3D primordial bispectrum B(ki,k 2 ,kj,) generated in the early universe; here, 
B{ki,k 2 ,k-i) is defined in terms of wavenumbers k, and must be projected forward using transfer functions 
describing the evolution of the Universe to predict the late-time 

The modal method is designed to constrain the bispectrum. If you integrate the bispectrum you get the 
skewness {i.e. how much the distribution leans to one side) of a distribution and, since the bispectrum is 
zero for a Gaussian distribution, this an effective test of non-Gaussianity. The issue is that the bispectrum is 
a full three dimensional quantity and so calculating and measuring it is nontrivial. If we wish to constrain a 
theory which predicts a bispectrum B{ki , k 2 , ko,) at the end of inflation, we first need to evolve it forward to 
today via convolution with transfer functions A: 


^fif2A - ^fif2A B(ki,k2,k2) f (n^^j A(^(ki)jeXrki)) drdkidk 2 dk 2 ■ (16) 


where /z is a geometric factor related to the projection onto a 2-sphere which will be defined below. This is a 
7-dimensional calculation and is impossible in practice. The modal method simplifies this by decomposing 
the inflationary bispectrum into a set of specially chosen basis functions for which this calculation can 
be dramatically simplified. These Q„ are defined as: 


Qn(kuk2,k2,) = ^ {qi(ki)qj(k2)qk(k3) + qj(ki)qk(k2)qi(k2) + qk(ki)qi{k2)qjih) 

+ qk{h)qj{k2)qi{h) + qj{h)qi{k2)qk{k2) + ^;(fci)ft(^2)?j(^3)) (17) 

where the relation between n to ijk is defined via a pre-determined one-to-one mapping which is opti¬ 
mised for convergence. The mapping both is non-analytic and sparse (so not all, or even most, ijk triples 
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correspond to a n) and is read in from a pre-calculated list. An example mapping would look like; 


n i j k 

0 ^000 
1 ^001 
2 ^011 
3^111 

4 ^002 

5 -) 0 1 2 


(18) 


With this basis we then have B' - ik\k 2 k^)^B - Ym^nQn where the B' is the signal to noise weighted 
version of B. The particular form of the basis function allows the projection to be calculated simply and we 
have; 


QnC\£2^'i — ^ / P'dr {^i{r, (\)qj{r, ( 2 )qk{r, ti) + 5 permutations) (19) 

~qiM = Jdkk%(k)Adk)M/cr) (20) 

and now Bc^(^(^ = d„Qn and the convolution is effectively 2-dimensional. However this form still proves 
difficult to use for estimation because of the radial integral r (i.e. distance from inflation to now, along a 
line of sight). It is more efficient to use a second basis at the time of observation; 

Qntii2ti - g {qi(.(\)qj(.i2)qk{(i) + qj{(i)qk{(2)qi{(2) + qk{(\)qi{(2)qi{(2) 

+ qk{i\)qj{(2)qi{(2) + qj{(i)qi{(2)qk{i2) + qi{(\)qk(h)qj{h)), ( 21 ) 

and to project a signal to noise weighted version of the Q into this new basis, so that we have 

<2« = X = Z r,™ ™ Q„, (22) 

m m 

where the v, = {2£ -H 1)'^®. If we define the inner product (which is designed to mimic the signal to noise 
structure of the estimator) as; 


(A, B)t = y \ At,e2t2 Bt,e2t2 , (23) 

^ Wve,ve,l 

where h the previously mentioned geometric factor defined by 

, 2 (2A + m(2 + 1)(2^3 + 1) / a {2 h 

Kt2^2- -( 0 0 0 ) ■ 

Using this we can then define T in terms of the inner product as; 

= ( 25 ) 

r 

Thus for optimising the calculation of T we only need to focus on optimising the evaluation of the inner 
product. The majority of the calculation time is in evaluation the second inner product due to the radial 
integral inherent in Q, so we will restrict our attention to that part alone defining 

K,n-{Qr,Qn) (26) 

which is Equation]^ The reader is referred to pT| and fT^ for a full explanation of the physics. 
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